B-Tree Basic
B-Trees are one of the most popular Storage Data Structure for Storage Engines and Databases in the wild. This is due to it's nature of addressing issues like search by using logarithmic search in huge datasets - which can reduce searches in 4bi entries by only 32 op's, minimizing the amount of Disk IO we need to fetch certain datas.
Structure
A B-Tree is sorted data-structure belonging the Tree
family - variant Balanced Tree - that consists of a Root Node, Internal Nodes and Leaf Nodes.
Each Internal Node
contain a N Keys (a.k.a index-entries
,separator-key
or dividir cells
) and N+1 Pointers, being the +1
a point to the next Internal Node
Leaf Nodes
should always be at the same depth in B-Tree to prevent Tree getting unbalanced
The Tree
is divided by Index Entries or Separator-Key or Divider Cells.
These Keys splits the tree into smaller sub-trees - which holds a certain subrange of keys K1 < 1 < K3
B-Trees are organized into pages
(2kb to 16kb - or a page from disk which may change from OS & Storage combo). The term Node
is interchangeable with Page
Occupancy
is the relation between the amount of keys a Node holds and it's capacity - which characterizes the B-Tree Nodes due to it's high fanout nature.
A B-Tree delivers two essential assumptions about the data-structure:
- High Fanout: which just means it grows horizontally - ie adding more
internal
nodes and just splitting/merging by necessity - Low Height: We only have 3 levels root, internal and leaf.
Algorithms
B-Tree Insert Algorithm
- Perform B-Tree lookup to find Leaf Node and locate an
insert point
- Append the key and the value to the
leaf node
insertion point.- Updates only change the value.
- In case of
overflowed
leaf nodes, ie exceeded maximum capacity, we split theleaf node
- Split int case
leaf node
N+1 is reached - for
Internal Nodes
(N+1)+1 is reached.
- Split int case
B-Tree Insert (Splits) Promoted
A insert is done by allocating a new node and transferring half of the keys and values into it. Then adding it's first key and point to the parent node. Being split point
the index where we split the data, we transfer data from split point++
into the new node.
If needed, we'll propagate the splits up above.
B-Tree Lookup Algorithm
- From root node
- Perform BST comparing the searched KEY within the keys stored in
Internal Nodes
- Find the first
separator-key
- Access the
InternalNode.subtree
from that `separator-key- Repeat the process of looking for equal separator-key until you reach a leaf-node
- If there's no key equal to your searched key and the left-most is lower, there's no key in your B-Tree
B-Tree and Disk Drives:
HDD: The most expensive operation for a HDD is a disk seeks which consists of manually moving the mechanical head of the HDD to a certain position - to enable reading certain sectors
. After moving the mechanical head, reading is very cheap.
Typical range of bytes read in each sector - 512bytes to 4kb
SSD: SSDs are structure into small segmented structures hierarchiachly (??) by 32/64 memory cells -> 1 string -> arrays -> (2kb to 16kb)pages -> blocks -> planes -> N die
A Block
usually has 64 to 512 pages.
The smallest unit for a SSD is a page
. (2kb to 16kb).
HOWEVER writing can only occur in empty blocks - ie we cannot update data. So we need to empty a block (delete it) - which can only occur by erase block
.
Differently from HDD who drivers need to control the mechanical head to seek for data, SSD keeps a controller (FTL) that maps pageId to their physical location - and controls the entire hardware.
Responsabilities from FTL:
- Garbage Collection
- Mapping page ids to location in-disk
- Tracking
- empty pages
- written pages
- discarded pages
Differences between SSD and HDD
Nowdays OS's abstract interactions with the storage by block devices
abstraction and let us access data in chunks. Even if ask a small bit of data, at least a block is read from the storage.
The main difference resides in using HDD in case you workload is built of sequential IO
which benefits from HDD cheap read
operations after a seek.
Sequential IO can be describe as the phenomemon of doing any kind of IO (W/R) in sequentia locality - ie in the next offset - preventing any expensive Disk Seeks